Let \(\mu_1\) denote the mean repair time for the the ILEC customers and \(\mu_2\) the mean repair time for the the CLEC customers.
Form the hypothesis. \(H_0: \mu_1 - \mu_2 = 0\) vs. \(H_a: \mu_1 - \mu_2 < 0\)
Conduct the test.
data(Verizon)
library(dplyr)
library(knitr)
ans <- Verizon %>%
group_by(Group) %>%
summarize(mean = mean(Time), N = n())
ans
## # A tibble: 2 x 3
## Group mean N
## <fct> <dbl> <int>
## 1 CLEC 16.5 23
## 2 ILEC 8.41 1664
md = ans[2,2] - ans[1,2]
observed = md$mean
observed
## [1] -8.09752
set.seed(84)
sims <- 10^4-1 # Number of simulations (iterations)
ans <- numeric(sims)
for(i in 1:sims){
index <- sample.int(n = 1687, size = 1664, replace = FALSE) #n = total, size = num in the first group
ans[i] <- mean(Verizon$Time[index]) - mean(Verizon$Time[-index])
}
pvalue <- (sum(ans <= observed) + 1)/(sims + 1)
pvalue
## [1] 0.0169
library(ggplot2)
ggplot(data = data.frame(md = ans), aes(x = md)) +
geom_density(fill = "orange") +
theme_bw() +
labs(x = expression(bar(X)[m]-bar(X)[f])) +
geom_vline(xintercept=observed, linetype="dotted")
Step 5: Make the decision –
Technical conclusion: \(P\)-Value < 0.05, so Reject \(H_0\).
English conclusion: There is enough evidence to conclude that the Verizon spend significantly more time to complete repairs for CLEC customers than for the ILEC customers.
ggplot(data = Verizon, mapping = aes(x = Group, y = Time)) +
geom_boxplot()
ggplot(data = Verizon, mapping = aes(x = Time)) +
geom_histogram() + facet_wrap(~Group)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Chapter 6:
Example:
bty_avgscoreOn average, when you increase the bty_avg by one unit, the score will increase by m.
In regression we try to predict one variable based on one or more other variables.
Example 3.1:
Why do some professors and instructors at universities and colleges get high teaching evaluations from students while others don’t? What factors can explain these differences?
Researchers at the University of Texas in Austin, Texas (UT Austin) tried to answer this question: what factors can explain differences in instructor’s teaching evaluation scores? To this end, they collected information on \(n=463\) instructors. A full description of the study can be found at openintro.org.
We’ll keep things simple for now and try to explain differences in instructor evaluation scores as a function of one numerical variable: their “‘beauty score’”.
We’ll address these questions by modeling the relationship between these two variables with a particular kind of linear regression called simple linear regression. Simple linear regression is the most basic form of linear regression. With it we have
teaching score.beauty score.A crucial step before doing any kind of modeling or analysis is performing an exploratory data analysis, or EDA, of all our data.
Okay… Let’s begin. The dataframe that we are working on is evals and it is in the moderndive library.
Type library(moderndive) in the console and hit return, then type View(evals) in the console and hit return. Also type ?evals in the console to see the description of the dataframe.
library(moderndive)
library(dplyr)
#View(evals) #only works in the console
#?evals
glimpse(evals)
## Observations: 463
## Variables: 14
## $ ID <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 1…
## $ prof_ID <int> 1, 1, 1, 1, 2, 2, 2, 3, 3, 4, 4, 4, 4, 4, 4, 4, 4, 5, 5,…
## $ score <dbl> 4.7, 4.1, 3.9, 4.8, 4.6, 4.3, 2.8, 4.1, 3.4, 4.5, 3.8, 4…
## $ age <int> 36, 36, 36, 36, 59, 59, 59, 51, 51, 40, 40, 40, 40, 40, …
## $ bty_avg <dbl> 5.000, 5.000, 5.000, 5.000, 3.000, 3.000, 3.000, 3.333, …
## $ gender <fct> female, female, female, female, male, male, male, male, …
## $ ethnicity <fct> minority, minority, minority, minority, not minority, no…
## $ language <fct> english, english, english, english, english, english, en…
## $ rank <fct> tenure track, tenure track, tenure track, tenure track, …
## $ pic_outfit <fct> not formal, not formal, not formal, not formal, not form…
## $ pic_color <fct> color, color, color, color, color, color, color, color, …
## $ cls_did_eval <int> 24, 86, 76, 77, 17, 35, 39, 55, 111, 40, 24, 24, 17, 14,…
## $ cls_students <int> 43, 125, 125, 123, 20, 40, 44, 55, 195, 46, 27, 25, 20, …
## $ cls_level <fct> upper, upper, upper, upper, upper, upper, upper, upper, …
Since we are only interested in one \(x\) variable, namely, bty_avg, let’s select only the \(y\) variable, score and bty_avg.
evals_onex <- evals %>%
select(score, bty_avg)
# View(evals_onex) # Only In the console, DO NOT run the View command here
glimpse(evals_onex)
## Observations: 463
## Variables: 2
## $ score <dbl> 4.7, 4.1, 3.9, 4.8, 4.6, 4.3, 2.8, 4.1, 3.4, 4.5, 3.8, 4.5, 4…
## $ bty_avg <dbl> 5.000, 5.000, 5.000, 5.000, 3.000, 3.000, 3.000, 3.333, 3.333…
Since both the outcome variable score and the explanatory variable bty_avg are numerical, we can compute summary statistics about them such as the mean, median, and standard deviation.
Let’s pipe this into the skim() function from the skimr package. This function quickly return the following summary information about each variable.
library(skimr)
evals %>%
select(score, bty_avg) %>%
skim()
| Name | Piped data |
| Number of rows | 463 |
| Number of columns | 2 |
| _______________________ | |
| Column type frequency: | |
| numeric | 2 |
| ________________________ | |
| Group variables | None |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| score | 0 | 1 | 4.17 | 0.54 | 2.30 | 3.80 | 4.30 | 4.6 | 5.00 | ▁▁▅▇▇ |
| bty_avg | 0 | 1 | 4.42 | 1.53 | 1.67 | 3.17 | 4.33 | 5.5 | 8.17 | ▃▇▇▃▂ |
#Or, since we have already selected the variables, skim(evals_onex) would also works
Here, p0 for example the 0th percentile: the value at which 0% of observations are smaller than it. This is also known as the minimum.
According to the histograms: variable score is skewed to the left, and the variable bty_avg is skewed to the right.
We get an idea of how the values in both variables are distributed. For example, the mean teaching score was 4.17 out of 5 whereas the mean beauty score was 4.42 out of 10. Furthermore, the middle 50% of teaching scores were between 3.8 and 4.6 () while the middle 50% of beauty scores were between 3.17 and 5.5 out of 10.
Since we are considering the relationship between two numerical variables, it would be nice to have a summary statistic that simultaneously considers both variables. The correlation coefficient is a bivariate summary statistic that fits this bill.
A correlation coefficient is a quantitative expression between -1 and 1 that summarizes the strength of the linear relationship between two numerical variables:
Following figure gives examples of different correlation coefficient values for hypothetical numerical variables \(x\) and \(y\).
We see that while for a correlation coefficient of -0.75 there is still a negative relationship between \(x\) and \(y\), it is not as strong as the negative relationship between \(x\) and \(y\) when the correlation coefficient is -1.
The correlation coefficient is computed using the get_correlation() function in the moderndive package. Here is the syntax:
#your_dataframe %>%
# get_correlation(formula = response_variable ~ explanatory_variable)
evals_onex %>%
get_correlation(formula = score ~ bty_avg)
## # A tibble: 1 x 1
## cor
## <dbl>
## 1 0.187
Another way to get the correlations is:
cor(x = evals_onex$bty_avg, y = evals_onex$score)
## [1] 0.1871424
In our case, the correlation coefficient of 0.187 indicates that the linear relationship between teaching evaluation score and beauty average is “weakly positive.”
Properties of \(r\)
Since both the score and bty_avg variables are numerical, a scatter plot is an appropriate graph to visualize this data.
library(ggplot2)
ggplot(data = evals_onex, mapping = aes(x = bty_avg, y = score)) +
geom_point() +
geom_jitter() +
labs(x = "Beauty Score", y = "Teaching Score", title = "Relationship Between Teaching and Beauty Scores") +
geom_smooth(method = "lm")
#compute summary statistics
evals %>%
select(score, age) %>%
skim()
| Name | Piped data |
| Number of rows | 463 |
| Number of columns | 2 |
| _______________________ | |
| Column type frequency: | |
| numeric | 2 |
| ________________________ | |
| Group variables | None |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| score | 0 | 1 | 4.17 | 0.54 | 2.3 | 3.8 | 4.3 | 4.6 | 5 | ▁▁▅▇▇ |
| age | 0 | 1 | 48.37 | 9.80 | 29.0 | 42.0 | 48.0 | 57.0 | 73 | ▅▆▇▆▁ |
# According to the histograms: variable score is skewed to the left, and the variable age is uniformly distributed
#The mean teaching score was 4.17 out of 5 whereas the mean age was 48.37
#The median teaching score was 4.3 out of 5 whereas the median age was 48
#Furthermore, the middle 50% of teaching scores were between 3.8 and 4.6
# while the middle 50% of ages were between 42 and 57
#get correlation
cor(x = evals$age, y = evals$score)
## [1] -0.107032
#score and age have a slight negative correlation
#older teachers have slighty lower teaching scores than younger teachers
#visualization
ggplot(data = evals, mapping = aes(x = age, y = score)) +
geom_jitter() +
labs(x = "Age", y = "Teaching Score", title = "Relationship Between Teaching Scores and Age") +
geom_smooth(method = "lm")
The equation of the regression line is \(\hat{y} = b_0+b_1\cdot x\) where…
The vertical distances from the points to the line are called the residuals. We obtain the line by minimizing the sum of the squares of the distances from the points to the line. If we denote the observed points by \(y_i\) and the opints on the line by \(\hat{y_i}\). Then the reiduals can be denoted by \(y_i-\hat{y_i}\). So we obtain the line (\(b_0\) and \(b_1\)), by minimizing the following quantity
\(\texttt{Residual sum of squares} = \sum(y_i-\hat{y_i})^2\)
The lm() function that “fits” the linear regression model is typically used as lm(y ~ x, data = data_frame_name) where:
y is the outcome variable, followed by a tilde (~). In our case, y is set to score.x is the explanatory variable. In our case, x is set to bty_avg. We call the combination y ~ x a model formula.data_frame_name is the name of the data frame that contains the variables y and x. In our case, data_frame_name is the evals_onex data frame.score_model <- lm(score ~ bty_avg, data = evals_onex) #y ~ x
score_model
##
## Call:
## lm(formula = score ~ bty_avg, data = evals_onex)
##
## Coefficients:
## (Intercept) bty_avg
## 3.88034 0.06664
This output is telling us that the Intercept coefficient \(b_0\) of the regression line is 3.8803 and the slope coefficient (\(b_1\)) for bty_avg is 0.0666. Therefore the blue regression line is (line in the previous figure)
\(\hat{\text{score}} = b_0 + b_1 \cdot \text{bty_avg}\)
\(\hat{\text{score}} = 3.8803 + 0.0666 \cdot \text{bty_avg}\)
Interpretations of \(b_0\) and \(b_1\):
For instructors who with beauty score of 0, we would expect to have on average a teaching score of 3.8803. In this case interpretaion of \(b_0\) is meaningless. (Why?)
Range <- range(evals_onex$bty_avg)
Range
## [1] 1.667 8.167
bty_avg of 0 is outside the range of bty_avg values
The intercept coefficient \(b_1=+0.0666\):
This is a numerical quantity that summarizes the relationship between the outcome and explanatory variables. Note that the sign is positive, suggesting a positive relationship between beauty scores and teaching scores, meaning as beauty scores go up, so also do teaching scores go up. The slope’s precise interpretation is:
For every of 1 unit increase in
bty_avg, there is an associated increase of, on average, 0.0666 (\(b_1\)) units ofscore.
score1 = 3.8803 + 0.0666 * (7.5)
score1
## [1] 4.3798
\(\hat{score} = 3.8803 + 0.0666 \times 7.5 = 4.3798\)
Extrapolation is using the regression to predict \(y\) from \(x\) outside the range of x values. Such predictions are often not accurate.
score2 = 3.8803 + 0.0666 * (9.1)
score2
## [1] 4.48636
Fit a new simple linear regression for score where age is the new explanatory variable \(x\).
Interpret the regression coefficients.
age_model = lm(score ~ age, data = evals)
age_model
##
## Call:
## lm(formula = score ~ age, data = evals)
##
## Coefficients:
## (Intercept) age
## 4.461932 -0.005938
\(\hat{score} = 4.461932 - 0.005938 \cdot age\)
Let’s only consider one observation: For example, say we are interested in the 21st instructor in this dataset:
kable(evals_onex[21,])
| score | bty_avg |
|---|---|
| 4.9 | 7.333 |
Here in this example:
Now, when we have to find residuals for all the values (not just one). R can do it for us…
regression_points <- get_regression_points(score_model)
regression_points
## # A tibble: 463 x 5
## ID score bty_avg score_hat residual
## <int> <dbl> <dbl> <dbl> <dbl>
## 1 1 4.7 5 4.21 0.486
## 2 2 4.1 5 4.21 -0.114
## 3 3 3.9 5 4.21 -0.314
## 4 4 4.8 5 4.21 0.586
## 5 5 4.6 3 4.08 0.52
## 6 6 4.3 3 4.08 0.22
## 7 7 2.8 3 4.08 -1.28
## 8 8 4.1 3.33 4.10 -0.002
## 9 9 3.4 3.33 4.10 -0.702
## 10 10 4.5 3.17 4.09 0.409
## # … with 453 more rows
The plot of residuals against the fitted values (\(\hat{y}\), \(y_i-\hat{y_i}\)) provides infomation on the appropriateness of a straight-line model. Ideally, points should be scattered randomly about the reference line \(y=0\) See the following Figure.
Residual plots are useful for the followings:
Example: Create a residul plot for the score_model and interpret the residial plot.
ggplot(score_model, aes(x = .fitted, y = .resid)) + geom_point()
Plotted points are scattered randomly (no pattern) about the reference line \(residual=0\). Also, no outliers are detected. Therefore the residual plot indicate that the relationship between bty_avg and score is linear.
Example: Create a residul plot for the model where y=score and x=age and interpret the residial plot.
ggplot(age_model, aes(x = .fitted, y = .resid)) + geom_point()
When the explanatory variable \(x\) is categorical, the concept of a “best-fitting” line is a little different than the one we saw in the previous Section where the explanatory variable \(x\) was numerical.
We use the following example to study this.
It’s an unfortunate truth that life expectancy is not the same across various countries in the world; there are a multitude of factors that are associated with how long people live. International development agencies are very interested in studying these differences in the hope of understanding where governments should allocate resources to address this problem. In this section, we’ll explore differences in life expectancy in two ways:
To answer such questions, we’ll study the gapminder dataset in the gapminder package. This dataset has international development statistics such as life expectancy, GDP per capita, and population by country (\(n = 142\)) for 5-year intervals between 1952 and 2007.
We’ll use this data for linear regression again, but note that our explanatory variable \(x\) is now categorical, and not numerical like when we covered simple linear regression in Section 6.1. More precisely, we have:
As always, the first step in model building is…
View(gapminder) in the console.Let’s load the gapminder data and filter() for only observations in 2007. Next we select() only the variables country, continent, lifeExp, along with gdpPercap, which is each country’s gross domestic product per capita (GDP). GDP is a rough measure of that country’s economic performance. Lastly, we save this in a data frame with name gapminder2007:
library(gapminder)
data(gapminder)
str(gapminder)
## Classes 'tbl_df', 'tbl' and 'data.frame': 1704 obs. of 6 variables:
## $ country : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ year : int 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
## $ lifeExp : num 28.8 30.3 32 34 36.1 ...
## $ pop : int 8425333 9240934 10267083 11537966 13079460 14880372 12881816 13867957 16317921 22227415 ...
## $ gdpPercap: num 779 821 853 836 740 ...
gapminder2007 = gapminder %>%
filter(year == 2007) %>%
select(country, continent, lifeExp, gdpPercap)
glimpse(gapminder2007)
## Observations: 142
## Variables: 4
## $ country <fct> Afghanistan, Albania, Algeria, Angola, Argentina, Australia…
## $ continent <fct> Asia, Europe, Africa, Africa, Americas, Oceania, Europe, As…
## $ lifeExp <dbl> 43.828, 76.423, 72.301, 42.731, 75.320, 81.235, 79.829, 75.…
## $ gdpPercap <dbl> 974.5803, 5937.0295, 6223.3675, 4797.2313, 12779.3796, 3443…
Notice that variable continent is indeed categorical.
Let’s apply the skim() function from the skimr package to our two variables of interest: continent and lifeExp:
library(skimr)
gapminder2007 %>%
select(continent, lifeExp) %>%
skim()
| Name | Piped data |
| Number of rows | 142 |
| Number of columns | 2 |
| _______________________ | |
| Column type frequency: | |
| factor | 1 |
| numeric | 1 |
| ________________________ | |
| Group variables | None |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| continent | 0 | 1 | FALSE | 5 | Afr: 52, Asi: 33, Eur: 30, Ame: 25 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| lifeExp | 0 | 1 | 67.01 | 12.07 | 39.61 | 57.16 | 71.94 | 76.41 | 82.6 | ▂▃▃▆▇ |
Given that the global median life expectancy is 71.94, half of the world’s countries (71 countries) will have a life expectancy less than 71.94. Further, half will have a life expectancy greater than this value. The mean life expectancy of 67.01 is lower however. Why are these two values different? Let’s look at a histogram of lifeExp to see why.
ggplot(data = gapminder2007, mapping = aes(x = lifeExp)) +
geom_histogram(binwidth = 5, color='white', fill='orchid4')
We see that this data is skewed to the left: there are a few countries with very low life expectancy that are bringing down the mean life expectancy. However, the median is less sensitive to the effects of such outliers. Hence the median is greater than the mean in this case.
tbl = gapminder2007 %>%
group_by(continent) %>%
summarize(median = median(lifeExp), mean = mean(lifeExp), N = n())
kable(tbl)
| continent | median | mean | N |
|---|---|---|---|
| Africa | 52.9265 | 54.80604 | 52 |
| Americas | 72.8990 | 73.60812 | 25 |
| Asia | 72.3960 | 70.72848 | 33 |
| Europe | 78.6085 | 77.64860 | 30 |
| Oceania | 80.7195 | 80.71950 | 2 |
We see now that there are differences in life expectancy between the continents. For example let’s focus on only the medians. While the median life expectancy across all \(n=142\) countries in 2007 was \(71.935\), the median life expectancy across the \(n=place\) countries in Africa was only \(place\).
Let’s create a corresponding visualization.
ggplot(data = gapminder2007, mapping = aes(x = lifeExp, fill = factor(continent))) +
geom_histogram(binwidth = 5, color = 'white') +
labs(x = "Life Expectancy", y = "Number of Countries", title = "Life Expectancy by Continent") +
facet_wrap(~ continent, nrow = 2)
ggplot(data = gapminder2007, aes(x = continent, y = lifeExp)) +
geom_boxplot() +
labs(x = "Continent", y = "Life Expectancy", title = "Life Expectancy by Continent")
Now, let’s start making comparisons of life expectancy between continents.
First we have to pick a baseline to compare with. Let’s use Africa.
In section 6.1 we introduced simple linear regression, which involves modeling the relationship between a numerical outcome variable \(y\) as a function of a numerical explanatory variable \(x\), in our life expectancy example, we now have a categorical explanatory variable \(x\) continent.
While we still can fit a regression model, given our categorical explanatory variable we no longer have a concept of a “best-fitting” line, but rather “differences relative to a baseline for comparison.”
Before we fit our regression model, let’s create a table similar to the previous table, but…
meanAfrica = 54.806
tbl = gapminder2007 %>%
group_by(continent) %>%
summarize(mean = round(mean(lifeExp),3), vsAfrica = round(mean(lifeExp),3)-meanAfrica)
kable(tbl)
| continent | mean | vsAfrica |
|---|---|---|
| Africa | 54.806 | 0.000 |
| Americas | 73.608 | 18.802 |
| Asia | 70.728 | 15.922 |
| Europe | 77.649 | 22.843 |
| Oceania | 80.719 | 25.913 |
Now, let’s use the lm function we introduced in Section 6.1 to get the regression coefficients for gapminder2007 analysis:
lifeExp_model <- lm(lifeExp ~ continent, data = gapminder2007)
lifeExp_model
##
## Call:
## lm(formula = lifeExp ~ continent, data = gapminder2007)
##
## Coefficients:
## (Intercept) continentAmericas continentAsia continentEurope
## 54.81 18.80 15.92 22.84
## continentOceania
## 25.91
summary(lifeExp_model)
##
## Call:
## lm(formula = lifeExp ~ continent, data = gapminder2007)
##
## Residuals:
## Min 1Q Median 3Q Max
## -26.9005 -4.0399 0.2565 3.3840 21.6360
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 54.806 1.025 53.446 < 2e-16 ***
## continentAmericas 18.802 1.800 10.448 < 2e-16 ***
## continentAsia 15.922 1.646 9.675 < 2e-16 ***
## continentEurope 22.843 1.695 13.474 < 2e-16 ***
## continentOceania 25.913 5.328 4.863 3.12e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.395 on 137 degrees of freedom
## Multiple R-squared: 0.6355, Adjusted R-squared: 0.6249
## F-statistic: 59.71 on 4 and 137 DF, p-value: < 2.2e-16
What are these values? First, we must describe the equation for fitted value \(\hat{y}\), which is a little more complicated when the \(x\) explanatory variable is categorical:
\(\hat{\text{life exp}} = b_0 + b_{Amer} \cdot 1_{Amer}(x) + b_{Asia} \cdot 1_{Asia}(x) + b_{Euro} \cdot 1_{Euro}(x) + b_{Ocean} \cdot 1_{Ocean}(x)\)
\(\hat{\text{life exp}} = 54.81 + 18.80 \cdot 1_{Amer}(x) + 15.92 \cdot 1_{Asia}(x) + 22.84 \cdot 1_{Euro}(x) + 25.91 \cdot 1_{Ocean}(x)\)
What does \(1_{A}(x)\) mean? (from your previous math course)
In mathematics this is known as an “indicator function” that takes one of two possible values:
\(1_{A}(x) = \begin{cases} 1 & x \in A \\ 0 & x \notin A \end{cases}\)
In a statistical modeling context this is also known as a “dummy variable”. In our case, let’s consider the first such indicator variable:
\(1_{Amer}(x) = \begin{cases} 1 & \text{if country x is in the Americas} \\ 0 & \text{otherwise} \end{cases}\)
Now let’s interpret the regression coefficients.
The fitted value yielded by this equation is: (i.e. in this case, only the indicator function \(1_{Amer}(x)\) is equal to 1, but all others are 0.
\(\hat{\text{life exp}} = 54.81 + 18.80 \cdot 1_{Amer}(x) + 15.92 \cdot 1_{Asia}(x) + 22.84 \cdot 1_{Euro}(x) + 25.91 \cdot 1_{Ocean}(x)\)
\(\hat{\text{life exp}} = 54.81 + 18.80 \cdot 1 + 15.92 \cdot 0 + 22.84 \cdot 0 + 25.91 \cdot 0\)
\(\hat{\text{life exp}} = 54.81 + 18.80\)
\(\hat{\text{life exp}} = 72.9\)
Furthermore, this value corresponds to the group mean life expectancy for all American countries.
Rest of the coefficients can be interpreted the same way.
Let’s generalize…
We’ll use the Credit dataframe from the ISLR package to demonstrate multiple regression with:
Credit dataView command to look at raw data.select() only Balance, Limit, Income, Rating and Age variables. (We will be using Rating and Age in a forthcoming exercise)library(ISLR)
data(Credit)
#View(Credit)
glimpse(Credit)
## Observations: 400
## Variables: 12
## $ ID <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, …
## $ Income <dbl> 14.891, 106.025, 104.593, 148.924, 55.882, 80.180, 20.996, …
## $ Limit <int> 3606, 6645, 7075, 9504, 4897, 8047, 3388, 7114, 3300, 6819,…
## $ Rating <int> 283, 483, 514, 681, 357, 569, 259, 512, 266, 491, 589, 138,…
## $ Cards <int> 2, 3, 4, 3, 2, 4, 2, 2, 5, 3, 4, 3, 1, 1, 2, 3, 3, 3, 1, 2,…
## $ Age <int> 34, 82, 71, 36, 68, 77, 37, 87, 66, 41, 30, 64, 57, 49, 75,…
## $ Education <int> 11, 15, 11, 11, 16, 10, 12, 9, 13, 19, 14, 16, 7, 9, 13, 15…
## $ Gender <fct> Male, Female, Male, Female, Male, Male, Female, Male, …
## $ Student <fct> No, Yes, No, No, No, No, No, No, No, Yes, No, No, No, No, N…
## $ Married <fct> Yes, Yes, No, No, Yes, No, No, No, No, Yes, Yes, No, Yes, Y…
## $ Ethnicity <fct> Caucasian, Asian, Asian, Asian, Caucasian, Caucasian, Afric…
## $ Balance <int> 333, 903, 580, 964, 331, 1151, 203, 872, 279, 1350, 1407, 0…
credit = Credit %>% select(Balance, Limit, Income, Rating, Age)
glimpse(credit)
## Observations: 400
## Variables: 5
## $ Balance <int> 333, 903, 580, 964, 331, 1151, 203, 872, 279, 1350, 1407, 0, …
## $ Limit <int> 3606, 6645, 7075, 9504, 4897, 8047, 3388, 7114, 3300, 6819, 8…
## $ Income <dbl> 14.891, 106.025, 104.593, 148.924, 55.882, 80.180, 20.996, 71…
## $ Rating <int> 283, 483, 514, 681, 357, 569, 259, 512, 266, 491, 589, 138, 3…
## $ Age <int> 34, 82, 71, 36, 68, 77, 37, 87, 66, 41, 30, 64, 57, 49, 75, 5…
Let’s look at some summary statistics for the variables that we need for the problem at hand.
library(skimr)
library(ggplot2)
library(cowplot)
##
## ********************************************************
## Note: As of version 1.0.0, cowplot does not change the
## default ggplot2 theme anymore. To recover the previous
## behavior, execute:
## theme_set(theme_cowplot())
## ********************************************************
credit %>% skim()
| Name | Piped data |
| Number of rows | 400 |
| Number of columns | 5 |
| _______________________ | |
| Column type frequency: | |
| numeric | 5 |
| ________________________ | |
| Group variables | None |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| Balance | 0 | 1 | 520.01 | 459.76 | 0.00 | 68.75 | 459.50 | 863.00 | 1999.00 | ▇▅▃▂▁ |
| Limit | 0 | 1 | 4735.60 | 2308.20 | 855.00 | 3088.00 | 4622.50 | 5872.75 | 13913.00 | ▆▇▃▁▁ |
| Income | 0 | 1 | 45.22 | 35.24 | 10.35 | 21.01 | 33.12 | 57.47 | 186.63 | ▇▂▁▁▁ |
| Rating | 0 | 1 | 354.94 | 154.72 | 93.00 | 247.25 | 344.00 | 437.25 | 982.00 | ▆▇▃▁▁ |
| Age | 0 | 1 | 55.67 | 17.25 | 23.00 | 41.75 | 56.00 | 70.00 | 98.00 | ▆▇▇▇▁ |
pAge = ggplot(data = credit, mapping = aes(x = Age)) +
geom_histogram(binwidth = 10, fill = 'red2', color = 'red4')
pBalance = ggplot(data = credit, mapping = aes(x = Balance)) +
geom_histogram(binwidth = 200, fill = 'blue2', color = 'blue4')
pLimit = ggplot(data = credit, mapping = aes(x = Limit)) +
geom_histogram(binwidth = 1000, fill = 'yellow2', color = 'yellow4')
pRating = ggplot(data = credit, mapping = aes(x = Rating)) +
geom_histogram(binwidth = 100, fill = 'green2', color = 'green4')
plot_grid(pAge, pBalance, pLimit, pRating)
Let’s also look at histograms as visual aids.
We observe for example:
Since our outcome variable Balance and the explanatory variables Limit and Income are numerical, we can and have to compute the correlation coefficient between pairs of these variables before we proceed to build a model.
credit %>%
select(Balance, Limit, Income) %>%
cor()
## Balance Limit Income
## Balance 1.0000000 0.8616973 0.4636565
## Limit 0.8616973 1.0000000 0.7920883
## Income 0.4636565 0.7920883 1.0000000
Balance with Limit is \(0.862\). This indicates a strong positive linear relationship, which makes sense as only individuals with large credit limits can accrue large credit card balances.Balance with Income is \(0.464\). This is suggestive of another positive linear relationship, although not as strong as the relationship between Balance and Limit.Limit and Income of \(0.792\). In this case, we say there is a high degree of collinearity between these two explanatory variables.Note: Collinearity (or multicollinearity) is a phenomenon in which one explanatory variable in a multiple regression model can be linearly predicted from the others with a substantial degree of accuracy. So in this case, if we knew someone’s credit card Limit and since Limit and Income are highly correlated, we could make a fairly accurate guess as to that person’s Income. Or put loosely, these two variables provided redundant information. For now let’s ignore any issues related to collinearity and press on.
Let’s visualize the relationship of the outcome variable with each of the two explanatory variables in two separate plots:
To get a sense of the joint relationship of all three variables simultaneously through a visualization, let’s display the data in a 3-dimensional (3D) scatterplot, where
Balance is on the \(z\)-axis (vertical axis)Income is on of the floor axes.Limit is on the other floor axis.library(plotly)
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
p <- plot_ly(data = Credit, z = ~Balance, x = ~Income, y = ~Limit, opacity = 0.6, color = Credit$Balance) %>%
add_markers()
p
Conduct a new exploratory data analysis with the same outcome variable \(y\) being Balance but with Rating and Age as the new explanatory variables \(x_1\) and \(x_2\). Remember, this involves three things:
What can you say about the relationship between a credit card holder’s balance and their credit rating and age?
credit %>%
select(Balance, Rating, Age) %>%
cor()
## Balance Rating Age
## Balance 1.000000000 0.8636252 0.001835119
## Rating 0.863625161 1.0000000 0.103164996
## Age 0.001835119 0.1031650 1.000000000
Balance with Limit is 0.862. This indicates a strong positive linear relationship, which makes sense as only individuals with large credit limits can accrue large credit card balances. Balance with Income is 0.464. This is suggestive of another positive linear relationship, although not as strong as the relationship between Balance and Limit. As an added bonus, we can read off the correlation coefficient of the two explanatory variables, Limit and Income of 0.792. In this case, we say there is a high degree of collinearity between these two explanatory variables.
We now use a + to consider multiple explanatory variables. Here is the syntax:
model_name <- lm(y ~ x1 + x2 + ... +xn, data = data_frame_name)
Balance_model <- lm(Balance ~ Limit + Income, data = Credit)
Balance_model
##
## Call:
## lm(formula = Balance ~ Limit + Income, data = Credit)
##
## Coefficients:
## (Intercept) Limit Income
## -385.1793 0.2643 -7.6633
# Or use one of the followings to see more info...
library(moderndive)
get_regression_table(Balance_model)
## # A tibble: 3 x 7
## term estimate std_error statistic p_value lower_ci upper_ci
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 intercept -385. 19.5 -19.8 0 -423. -347.
## 2 Limit 0.264 0.006 45.0 0 0.253 0.276
## 3 Income -7.66 0.385 -19.9 0 -8.42 -6.91
#summary(Balance_model)
Model: \(\hat{Balance}= -385 + 0.264 \cdot Limit - 7.66 \cdot Income\)
How do we interpret these three values that define the regression plane?
Intercept: \(-\$385.18\) (rounded to two decimal points to represent cents). The intercept in our case represents the credit card balance for an individual who has both a credit Limit of \(\$0\) and Income of \(\$0\).
Limit: \(\$0.26\). Now that we have multiple variables to consider, we have to add a caveat to our interpretation: Holding all the other variables fixed (Income, in this case), for every increase of one unit in credit Limit (dollars), there is an associated increase of on average \(\$0.26\) in credit card balance.Income: \(-\$7.66\). Similarly, Holding all the other variables fixed (Limit, in this case), for every increase of one unit in Income (in other words, \(\$10000\) in income), there is an associated decrease of on average \(\$7.66\) in credit card balance.As we did previously, let’s look at the fitted values and residuals.
regression_points <- get_regression_points(Balance_model)
regression_points
## # A tibble: 400 x 6
## ID Balance Limit Income Balance_hat residual
## <int> <int> <int> <dbl> <dbl> <dbl>
## 1 1 333 3606 14.9 454. -121.
## 2 2 903 6645 106. 559. 344.
## 3 3 580 7075 105. 683. -103.
## 4 4 964 9504 149. 986. -21.7
## 5 5 331 4897 55.9 481. -150.
## 6 6 1151 8047 80.2 1127. 23.6
## 7 7 203 3388 21.0 349. -146.
## 8 8 872 7114 71.4 948. -76.0
## 9 9 279 3300 15.1 371. -92.2
## 10 10 1350 6819 71.1 873. 477.
## # … with 390 more rows
ggplot(Balance_model, aes(x = .fitted, y = .resid)) + geom_point()
Assuming the model is good…
Kevin has a credit limit of $5080 and his income is $150,000. Use the Balance_model to predict Kevin’s balance.
newx <- data.frame(Limit = _____, Income = ____)
predicted_balance <- predict(Balance_model, newx)
predicted_balance
newx = data.frame(Limit = c(5080,4090), Income = c(15,30))
predicted_balance = predict(Balance_model, newx)
predicted_balance
## 1 2
## 842.6244 465.9962
Let’s revisit the instructor evaluation data introduced in Ch 6.
Consider a modeling scenario with:
age.gender.Let’s reload the evals data and select() only the needed subset of variables. Note that these are different than the variables chosen in Chapter 6. Let’s given this the name evals_ch7.
skimr package:library(moderndive)
data(evals)
evals_ch7 = evals %>% select(age, gender, score)
glimpse(evals_ch7)
## Observations: 463
## Variables: 3
## $ age <int> 36, 36, 36, 36, 59, 59, 59, 51, 51, 40, 40, 40, 40, 40, 40, 40…
## $ gender <fct> female, female, female, female, male, male, male, male, male, …
## $ score <dbl> 4.7, 4.1, 3.9, 4.8, 4.6, 4.3, 2.8, 4.1, 3.4, 4.5, 3.8, 4.5, 4.…
evals_ch7 %>% skim()
| Name | Piped data |
| Number of rows | 463 |
| Number of columns | 3 |
| _______________________ | |
| Column type frequency: | |
| factor | 1 |
| numeric | 2 |
| ________________________ | |
| Group variables | None |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| gender | 0 | 1 | FALSE | 2 | mal: 268, fem: 195 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| age | 0 | 1 | 48.37 | 9.80 | 29.0 | 42.0 | 48.0 | 57.0 | 73 | ▅▆▇▆▁ |
| score | 0 | 1 | 4.17 | 0.54 | 2.3 | 3.8 | 4.3 | 4.6 | 5 | ▁▁▅▇▇ |
score and age. Recall that correlation coefficients only exist between numerical variables.evals_ch7 %>%
get_correlation(formula = score ~ age)
## # A tibble: 1 x 1
## cor
## <dbl>
## 1 -0.107
We observe that the age and the score are weakly and negatively correlated.
Now, let’s try to visualize the correlation.
score over age. Use the binary gender variable to color the point with two colors. Add a regression line (or two) in to your scatterplot.
ggplot(data = evals_ch7, mapping = aes(x = age, y = score, color = gender)) +
geom_point(alpha = 0.5) +
labs(x = "Age", y = "Teaching Score", title = "Using method = lm") +
geom_smooth(method = "lm", se = FALSE)
Much like we started to consider multiple explanatory variables using the + sign in the previous section, let’s fit a regression model and get the regression table.
score_model_ch7 = lm(score ~ age + gender, data = evals_ch7)
score_model_ch7
##
## Call:
## lm(formula = score ~ age + gender, data = evals_ch7)
##
## Coefficients:
## (Intercept) age gendermale
## 4.484116 -0.008678 0.190571
#get_regression_table(score_model_ch7)
Full: \(\hat{Score} = 4.48 - 0.009 \cdot age + 0.191 \cdot 1_{Male}(x)\)
Male: \(\hat{Score_M} = 4.671 - 0.009 \cdot age\)
Female: \(\hat{Score_F} = 4.48 - 0.009 \cdot age\)
Let’s create the scatterplot of score over age again. Use the binary gender variable to color the point with two colors. Add a regression lines in to your scatterplot but use the model(s) we created.
ggplot(data = evals_ch7, mapping = aes(x = age, y = score, color = gender)) +
geom_point(alpha = 0.5) +
labs(x = "Age", y = "Teaching Score", title = "Using Parallel Slopes") +
geom_abline(intercept = 4.48, slope = -0.009, color = "tomato", lwd=1) +
geom_abline(intercept = 4.671, slope = -0.009, color = "mediumturquoise", lwd=1)
Interpretaions of the coefficients:
We say a model has an interaction effect if the associated effect of one variable depends on the value of another variable. These types of models usually prove to be tricky to view on first glance because of their complexity. In this case, the effect of age will depend on the value of gender. (as was suggested by the different slopes for men and women in our visual exploratory data analysis)
Let’s fit a regression with an interaction term. We add an interaction term using the * sign. Let’s fit this regression and save it in score_model_interaction, then we get the regression table using the get_regression_table() function as before.
score_model_interaction <- lm(score ~ age + gender + age * gender, data = evals_ch7)
get_regression_table(score_model_interaction)
## # A tibble: 4 x 7
## term estimate std_error statistic p_value lower_ci upper_ci
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 intercept 4.88 0.205 23.8 0 4.48 5.29
## 2 age -0.018 0.004 -3.92 0 -0.026 -0.009
## 3 gendermale -0.446 0.265 -1.68 0.094 -0.968 0.076
## 4 age:gendermale 0.014 0.006 2.45 0.015 0.003 0.024
The modeling equation for this scenario is (Writing the equation):
\(\hat{y} = b_0 + b_1 \cdot x_1 + b_2 \cdot x_2 + b_3 \cdot x_1 \cdot x_2\)
\(\hat{score} = 4.883 - 0.018 \cdot age - 0.446 \cdot 1_{Male}(x) + 0.014 \cdot age \cdot 1_{Male}(x)\)
The model for male:
\(\hat{score} = 4.883 - 0.018 \cdot age - 0.446 \cdot 1 + 0.014 \cdot age \cdot 1\)
\(\hat{score} = 4.437 - 0.004 \cdot age\)
The model for female:
\(\hat{score} = 4.883 - 0.018 \cdot age - 0.446 \cdot 0 + 0.014 \cdot age \cdot 0\)
\(\hat{score} = 4.883 - 0.018 \cdot age\)
We see that while male instructors have a lower intercept, as they age, they have a less steep associated average decrease in teaching scores: 0.004 teaching score units per year as opposed to -0.018 for women. This is consistent with the different slopes and intercepts of the red and blue regression lines fit in the original scatter plot.
ggplot(data = evals_ch7, mapping = aes(x = age, y = score, color = gender)) +
geom_point(alpha = 0.5) +
labs(x = "Age", y = "Teaching Score", title = "Using Parallel Slopes") +
geom_abline(intercept = 4.883, slope = -0.018, color = "tomato", lwd=1) +
geom_abline(intercept = 4.437, slope = -0.004, color = "mediumturquoise", lwd=1)
A bowl has a certain number of red and a certain number of white balls all of equal size. Furthermore, it appears the bowl has been mixed beforehand as there does not seem to be any particular pattern to the spatial distribution of red and white balls.
What proportion of this bowl’s balls are red?
One way to answer this question would be to perform an exhaustive count: remove each ball individually, count the number of red balls and the number of white balls, and divide the number of red balls by the total number of balls. However, this would be a long and tedious process.
Observe that 17 of the balls are red and there are a total of 50 balls and thus 34% of the shovel’s balls are red. We can view the proportion of balls that are red in this shovel as a guess of the proportion of balls that are red in the entire bowl. While not as exact as doing an exhaustive count, our guess of 34% took much less time and energy to obtain.
However, say, we started this activity over from the beginning. In other words, we replace the 50 balls back into the bowl and start over. Would we remove exactly 17 red balls again? In other words, would our guess at the proportion of the bowl’s balls that are red be exactly 34% again? Maybe?
What if we repeated this exercise several times? Would I obtain exactly 17 red balls each time? In other words, would our guess at the proportion of the bowl’s balls that are red be exactly 34% every time? Surely not.
library(moderndive)
data(bowl)
head(bowl)
## # A tibble: 6 x 2
## ball_ID color
## <int> <chr>
## 1 1 white
## 2 2 white
## 3 3 white
## 4 4 red
## 5 5 white
## 6 6 white
Observe in the output that bowl has 2400 rows, telling us that the bowl contains 2400 equally-sized balls. The first variable ball_ID is used merely as an identification variable, none of the balls in the actual bowl are marked with numbers. The second variable color indicates whether a particular virtual ball is red or white.
Now that we have a virtual analogue of our bowl, we now need a virtual analogue for the shovel seen in Figure 2; we’ll use this virtual shovel to generate our virtual random samples of 50 balls. We’re going to use the rep_sample_n() function included in the moderndive package. This function allows us to take repeated, or replicated, samples of size \(n\). Run the following and explore.
virtual_shovel <- bowl %>%
rep_sample_n(size = 50)
virtual_shovel
## # A tibble: 50 x 3
## # Groups: replicate [1]
## replicate ball_ID color
## <int> <int> <chr>
## 1 1 2064 white
## 2 1 1019 white
## 3 1 601 red
## 4 1 2371 white
## 5 1 327 white
## 6 1 84 white
## 7 1 1801 white
## 8 1 1036 white
## 9 1 1257 white
## 10 1 2274 red
## # … with 40 more rows
Next we can find out how many red ones are there in our virtual_shovel
virtual_shovel %>%
summarize(num_red = sum(color=="red"))
## # A tibble: 1 x 2
## replicate num_red
## <int> <int>
## 1 1 14
How about the proportion on red? We can use the mutate (new) function to create a new variable, in this case prop_red.
virtual_shovel %>%
summarize(num_red = sum(color == "red")) %>%
mutate(prop_red = num_red / 50)
## # A tibble: 1 x 3
## replicate num_red prop_red
## <int> <int> <dbl>
## 1 1 14 0.28
virtual_samples <- bowl %>%
rep_sample_n(size = 50, reps = 30)
#kable(virtual_samples)
Observe that while the first 50 rows of replicate are equal to 1, the next 50 rows of replicate are equal to 2. This is telling us that the first 50 rows correspond to the first sample of 50 balls while the next 50 correspond to the second sample of 50 balls. This pattern continues for all reps = 30 replicates and thus virtual_samples has \(30×50=1500\) rows.
virtual_prop_red <- virtual_samples %>%
group_by(replicate) %>%
summarize(red = sum(color == "red")) %>%
mutate(prop_red = red / 50)
virtual_prop_red
## # A tibble: 30 x 3
## replicate red prop_red
## <int> <int> <dbl>
## 1 1 15 0.3
## 2 2 12 0.24
## 3 3 23 0.46
## 4 4 11 0.22
## 5 5 23 0.46
## 6 6 25 0.5
## 7 7 22 0.44
## 8 8 18 0.36
## 9 9 17 0.34
## 10 10 22 0.44
## # … with 20 more rows
Let’s visualize the distribution of these 33 proportions red based on 33 virtual samples using a histogram with binwidth = 0.05
ggplot(virtual_prop_red, aes(x = prop_red)) +
geom_histogram(binwidth = 0.05, boundary = 0.4, color = "white", fill = "steelblue") +
labs(x = "Proportion of 50 balls that were red",
title = "Distribution of 30 proportions red")
Observe that occasionally we obtained proportions red that are less than ____, while on the other hand we occasionally we obtained proportions that are greater than ____. However, the most frequently occurring proportions red out of 50 balls were between ____ % and ____ % (for ___ out 30 samples). Why do we have these differences in proportions red? Because of ___________________.
Redo the above activity with 1000 repeated samples and state your conclusions.
virtual_samples <- bowl %>%
rep_sample_n(size = 50, reps = 1000)
virtual_prop_red <- virtual_samples %>%
group_by(replicate) %>%
summarize(red = sum(color == "red")) %>%
mutate(prop_red = red / 50)
ggplot(virtual_prop_red, aes(x = prop_red)) +
geom_histogram(binwidth = 0.05, boundary = 0.4, color = "white", fill = "steelblue") +
labs(x = "Proportion of 50 balls that were red",
title = "Distribution of 1000 proportions red")
# Segment 1: sample size = 25 ------------------------------
# 1.a) Virtually use shovel 1000 times
virtual_samples_25 <- bowl %>%
rep_sample_n(size = 25, reps = 1000)
# 1.b) Compute resulting 1000 replicates of proportion red
virtual_prop_red_25 <- virtual_samples_25 %>%
group_by(replicate) %>%
summarize(red = sum(color == "red")) %>%
mutate(prop_red = red / 25)
# 1.c) Plot distribution via a histogram
p1 <- ggplot(virtual_prop_red_25, aes(x = prop_red)) +
geom_histogram(binwidth = 0.05, boundary = 0.4, color = "white") +
labs(x = "Pro of 25 balls that were red", title = "25")
# Segment 2: sample size = 50 ------------------------------
# 2.a) Virtually use shovel 1000 times
virtual_samples_50 <- bowl %>%
rep_sample_n(size = 50, reps = 1000)
# 2.b) Compute resulting 1000 replicates of proportion red
virtual_prop_red_50 <- virtual_samples_50 %>%
group_by(replicate) %>%
summarize(red = sum(color == "red")) %>%
mutate(prop_red = red / 50)
# 2.c) Plot distribution via a histogram
p2 <- ggplot(virtual_prop_red_50, aes(x = prop_red)) +
geom_histogram(binwidth = 0.05, boundary = 0.4, color = "white") +
labs(x = "Pro of 50 balls that were red", title = "50")
# Segment 3: sample size = 100 ------------------------------
# 3.a) Virtually using shovel with 100 slots 1000 times
virtual_samples_100 <- bowl %>%
rep_sample_n(size = 100, reps = 1000)
# 3.b) Compute resulting 1000 replicates of proportion red
virtual_prop_red_100 <- virtual_samples_100 %>%
group_by(replicate) %>%
summarize(red = sum(color == "red")) %>%
mutate(prop_red = red / 100)
# 3.c) Plot distribution via a histogram
p3 <- ggplot(virtual_prop_red_100, aes(x = prop_red)) +
geom_histogram(binwidth = 0.05, boundary = 0.4, color = "white") +
labs(x = "Pro of 100 balls that were red", title = "100")
plot_grid(p1, p2, p3, nrow = 1)
Observe that as the sample size increases, the ______ of the 1000 replicates of the proportion red decreases. In other words, as the sample size increases, there are less differences due to sampling variation and the distribution centers more tightly around the same value. Eyeballing the above Figure, things appear to center tightly around roughly ____%.
# n = 25
virtual_prop_red_25 %>%
summarize(sd = sd(prop_red))
## # A tibble: 1 x 1
## sd
## <dbl>
## 1 0.0946
# n = 50
virtual_prop_red_50 %>%
summarize(sd = sd(prop_red))
## # A tibble: 1 x 1
## sd
## <dbl>
## 1 0.0686
# n = 100
virtual_prop_red_100 %>%
summarize(sd = sd(prop_red))
## # A tibble: 1 x 1
## sd
## <dbl>
## 1 0.0470
As the sample size increases, our numerical measure of spread decreases; there is less variation in our proportions red. In other words, as the sample size increases, our guesses at the true proportion of the bowl’s balls that are red get more consistent and precise.
This was our first attempt at understanding two key concepts relating to sampling for estimation:
Putting it all together…
Definition 1.1: The sampling distribution of a Statistic (e.g. Mean, Median, Proportion, etc) is its probability distribution.
Definition 1.2: The standard deviation of a sampling distribution is called the standard error.
Find and plot the sampling distribution of the proportion (\(\hat{p}\)) of heads when you flip a fair coin. (Use 5000 sets of 10 tosses)
coin = c("Heads", "Tails")
coin = tbl_df(coin)
coin
## # A tibble: 2 x 1
## value
## <chr>
## 1 Heads
## 2 Tails
vs = coin %>% rep_sample_n(size = 10, reps = 5000, replace = TRUE)
vp = vs %>%
group_by(replicate) %>%
summarize(Heads = sum(value == "Heads")) %>%
mutate(pHeads = Heads / 10)
ggplot(vp, aes(x = pHeads)) +
geom_histogram(binwidth = 0.1, color = "white", fill = "brown") +
labs(x = "Proportion of 10 flipped heads",
title = "Distribution of 5000 proportions heads")
mean(vp$pHeads)
## [1] 0.4998
sd(vp$pHeads)
## [1] 0.1600784
The normal distribution is defined by the following probability density function, where \(\mu\) is the population mean and \(\sigma\) is the standard deviation.
\(f(x) = \dfrac{1}{\sigma \sqrt{2 \pi}}e^{-(x-\mu)^2/{2\sigma^2}}\)
If a random variable \(X\) follows the normal distribution, then we write: \(X \sim N(\mu, \sigma^2)\)
Here is how the normal density looks like: Ex: here \(X \sim N(0, 1)\)
Package: stats
x <- rnorm(n = 100, mean = 15, sd = 3)
ggplot(data.frame(x), aes(x = x)) + geom_histogram(binwidth = 1.5)
ggplot(data.frame(x), aes(x = x)) + geom_histogram(binwidth = 1.5) + geom_vline(xintercept = 21)
pnorm(21, mean = 15, sd = 3) # pnorm gives us the left tail area to a given number, 21 in this case
## [1] 0.9772499
qnorm(.25, mean = 15, sd = 3)
## [1] 12.97653
1 - pnorm(100, mean = 90, sd = sqrt(10))
## [1] 0.0007827011
qnorm(.95, mean = 527, sd = 112)
## [1] 711.2236
The Exponential Distribution is defined by the following probability density function, where \(\dfrac{1}{\lambda}\) is the population mean and the standard deviation.
\(f(x) = \lambda e^{-\lambda x}\)
If a random variable \(X\) follows the Exponential Distribution, then we write: \(X \sim Exp(\lambda)\)
Here is how the Exponential density looks like: Ex: here \(X \sim Exp(1/15)\)
Package: stats
x <- rexp(n = 100, rate = 1/15)
ggplot(data.frame(x), aes(x = x)) + geom_histogram(binwidth = 5)
pexp(21, rate = 1/15)
## [1] 0.753403
qexp(.75, rate = 1/15)
## [1] 20.79442
The number of days ahead travelers purchase their airline tickets can be modeled by an exponential distribution with the average amount of time equal to 15 days.
\(X \sim Exp(1/15)\)
pexp(10, rate = 1/15)
## [1] 0.4865829
qexp(.2, rate = 1/15)
## [1] 3.347153
The binomial distribution is a discrete probability distribution. It describes the outcome of n independent trials in an experiment. Each trial is assumed to have only two outcomes, either success or failure. If the probability of a successful trial is \(p\), then the probability of having \(x\) successful outcomes in an experiment of \(n\) independent trials is as follows:
\(f(x) = {n \choose x} p^x (1-p)^{(n-x)} \quad \text{where x = 0, 1, 2,...,n}\)
Example: Suppose there are twelve multiple choice questions in an English class quiz. Each question has five possible answers, and only one of them is correct.
\({12 \choose 4} 0.2^4 (1-0.2)^{(12-4)} = 0.1329\)
dbinom(4, size=12, prob=0.2)
## [1] 0.1328756
\({12 \choose 4} 0.2^4 (1-0.2)^{(12-4)} + {12 \choose 3} 0.2^3 (1-0.2)^{(12-3)} + {12 \choose 2} 0.2^2 (1-0.2)^{(12-2)} + {12 \choose 1} 0.2^1 (1-0.2)^{(12-1)} + {12 \choose 0} 0.2^0 (1-0.2)^{(12-0)} = 0.9274\)
xs = c(0:4)
sum = 0
for (x in xs) {
sum = sum + dbinom(x, size=12, prob=0.2)
}
sum
## [1] 0.9274445
#there is a function to do this but I don't know the name
Sampling distribution of \(\hat{p}\) is normal with mean \(p\) (Actual population proportion) and standard deviation of \(\sqrt{p(1-p)/n}\)
\(\hat{p} \sim N( p, \sqrt{p(1-p)/n} )\)
Let’s revisit the coin flip example. Check how close our simulation results are to the theoretical results.
mean(vp$pHeads)
## [1] 0.4998
sd(vp$pHeads)
## [1] 0.1600784
Now let’s find the theoretical mean and the standard deviation the sampling distribution of the proportion:
Here \(p=0.5, n=10\)
\(\hat{p} \sim N( p, \sqrt{p(1-p)/n} ) = N( 0.5, \sqrt{0.5(1-0.5)/10} ) = N( 0.5, 0.1581139)\)
XbarsXbars = numeric(1000)
for (i in 1:1000) {
x = rnorm(n = 100, mean = 10, sd = 3)
Xbars[i] = mean(x)
}
ggplot(data.frame(Xbars), aes(x = Xbars)) + geom_histogram(bins = 20, color = "white", fill = "steelblue")
mean(Xbars)
## [1] 9.984861
sd(Xbars)
## [1] 0.3082463
Xbars2Xbars2 = numeric(1000)
for (i in 1:1000) {
x = rexp(n = 100, rate = 1/15)
Xbars2[i] = mean(x)
}
ggplot(data.frame(Xbars2), aes(x = Xbars2)) + geom_histogram(bins = 20, color = "white", fill = "steelblue")
mean(Xbars2)
## [1] 15.0315
sd(Xbars2)
## [1] 1.526903
For a large sample size (rule of thumb: \(n≥30\)), the sample mean \(\bar{X}\) is approximately normally distributed, regardless of the distribution of the population one samples from. If the population has mean \(\mu\) and standard deviation \(\sigma\), then \(\bar{X}\) has mean and standard error \(\sigma/ \sqrt{n}\)
Therefore, according to the CLT: \(\bar{X} \sim N(\mu, \sigma/ \sqrt{n})\)
Note that the spread of the sampling distribution of the mean decreases as the sample size increases.
Note: In this class, we only have theoritical sampling distributions for \(\hat{p}\) and \(\bar{X}\) only. For other sample statistics like median, min, max, etc. we would only have the simulated results.
Finding probabilities using a sampling distribution.
The engines made by Ford for speedboats had an average power of 220 horsepower (HP) and standard deviation of 15 HP.
A potential buyer intends to take a sample of four engines and will not place an order if the sample mean is less than 215 HP. What is the probability that the buyer will not place an order? Assume that the parent distribution is Normal.
\(Pr(\bar{X}<215)= ?\)
\(\bar{X} \sim N(\mu, \dfrac{\sigma}{\sqrt{n}})\)
\(\bar{X} \sim N(220, \dfrac{15}{\sqrt{4}})\)
\(\bar{X} \sim N(220, 7.5)\)
pnorm(215, mean = 220, sd = 7.5)
## [1] 0.2524925
Xbars = numeric(1000)
for (i in 1:1000) {
x = rnorm(n = 4, mean = 220, sd = 15)
Xbars[i] = mean(x)
}
mean(Xbars < 215)
## [1] 0.236
ggplot(data.frame(Xbars), aes(x=Xbars)) + geom_histogram(color = "white", fill = "blue4", bins = 15)
A friend claim that she has drawn a random sample of size 30, from the exponential distribution \((f(x) = \lambda e^{-\lambda x})\) with \(\lambda = 1/10\)
The mean of her sample is 12.
#10
sims = 100000
Xbars = numeric(sims)
for (i in 1:sims) {
x = rexp(n = 30, rate = 1/10)
Xbars[i] = mean(x)
}
mean(Xbars > 12)
## [1] 0.13799
ggplot(data.frame(x = Xbars), aes(x = x, fill = x >= 12)) +
geom_histogram(binwidth = 0.1) +
theme_bw() +
labs(x = expression(bar(x))) +
geom_vline(xintercept = 12, linetype = "dashed") +
scale_fill_manual(values = c("grey", "black"))
Is a mean of 12 unusual for a sample of size 30 from \(\text{Exp}(\lambda = 1/10)\)?
Note: Since a mean of 12 or greater appears \(13.69%\) of the time, it is not considered unusual.
Consider the population {3, 5, 6, 6, 8, 11, 13, 15, 19, 20}.
v = c(3, 5, 6, 6, 8, 11, 13, 15, 19, 20)
mean(v)
## [1] 10.6
sd(v)
## [1] 5.985167
ggplot(data.frame(v), aes(x=v)) + geom_histogram(color = "grey1", fill = "grey3", binwidth = 3)
sims = 100000
Xbars = numeric(sims)
for (i in 1:sims) {
x = sample(v, size = 4)
Xbars[i] = mean(x)
}
mean(Xbars)
## [1] 10.5996
sd(Xbars)
## [1] 2.324477
ggplot(data.frame(x = Xbars), aes(x = x, fill = x >= 12)) +
geom_histogram(binwidth = 0.5) +
theme_bw() +
labs(x = expression(bar(x))) +
geom_vline(xintercept = 12, linetype = "dashed") +
scale_fill_manual(values = c("grey", "black"))
mean(Xbars < 11)
## [1] 0.54214
Suppose the heights of boys in a certain city follow a normal distribution with mean 48 in and variance 9^2.
1 - pnorm(51, mean = 48, sd = 9/sqrt(30))
## [1] 0.03394458
sims = 100000
Xbars = numeric(sims)
for (i in 1:sims) {
x = rnorm(n = 30, mean = 48, sd = 9)
Xbars[i] = mean(x)
}
mean(Xbars > 51)
## [1] 0.03373
The engines made by Ford for speedboats had an average power of 220 horsepower (HP) and standard deviation of 15 HP.
A potential buyer intends to take a sample of four engines and will not place an order if the sample median is less than 215 HP. What is the probability that the buyer will not place an order? Assume that the parent distribution is Normal.
NOT TAUGHT IN THIS CLASS
Medians = numeric(1000)
for (i in 1:1000) {
x = rnorm(n = 4, mean = 220, sd = 15)
Medians[i] = median(x)
}
mean(Medians < 215)
## [1] 0.28
The exact sampling distribution of \(W\) is:
\(N\left(mean = \mu_1 + \mu_2, sd = \sqrt{\sigma_1^2/n_1 + \sigma_2^2/n_2}\right)\)
\(N\left(20 + 16, \sqrt{8^2/10 + 7^2/15}\right) = N(36, 3.11)\)
R and plot your results. Check that the simulated mean and standard error are close to the theoretical mean and standard error.sims = 100000
Wbars = numeric(sims)
for (i in 1:sims) {
x = rnorm(n = 10, mean = 20, sd = 8)
y = rnorm(n = 15, mean = 16, sd = 7)
Wbars[i] = mean(x) + mean(y)
}
mean(Wbars)
## [1] 35.99492
sd(Wbars)
## [1] 3.103429
\(\widehat{P(W < 40)} = 0.9011\)
The data set Recidivism contains the population of all Iowa offenders convicted of either a felony or misdemeanor who were released in 2010 (Case study in Section 1.4) Of these, 31.6% recidivated and were sent back to prison. Simulate the sampling distribution of \(\hat{p}\), the sample proportion of offenders who recidivated, for random samples of size 25.
library(resampledata)
data("Recidivism")
sims = 100000
Xbars = numeric(sims)
for (i in 1:sims) {
x = sample(Recidivism$Recid, size = 25)
Xbars[i] = mean(x == "Yes")
}
mean(Xbars)
## [1] 0.316326
sd(Xbars)
## [1] 0.09327506
ggplot(data.frame(x = Xbars), aes(x = x)) +
geom_histogram(binwidth = 0.5) +
theme_bw() +
labs(x = expression(bar(x)))
set.seed(888)
v = c(3, 6, 7, 9, 11, 14)
sims = 100000
Xbars = numeric(sims)
for (i in 1:sims) {
x = sample(v, size = 3, replace = FALSE)
Xbars[i] = min(x)
}
# Plot of the sampling distribution...
ggplot(data.frame(x = Xbars), aes(x = x)) +
geom_histogram(bins = 10, color = "white", fill = "steelblue") +
theme_bw()
# Mean of the sampling distribution...
mean(Xbars)
## [1] 4.80477
\(\bar{X} = 4.80477\)
Note: For the question 2, you need to know about the Uniform distribution is. (We did not talked about this distribution in class) As far as \(R\) codes go, knowing how to generate a random sample from a uniform distribution in \(R\) is enough.
set.seed(888)
sims = 100000
Wbars = numeric(sims)
for (i in 1:sims) {
x = runif(n = 1, min = 40, max = 60)
y = runif(n = 1, min = 45, max = 80)
Wbars[i] = x + y
}
ggplot(data.frame(x = Wbars), aes(x = x)) +
geom_histogram(bins = 20, color = "white", fill = "steelblue") +
theme_bw()
mean(Wbars)
## [1] 112.4866
sd(Wbars)^2
## [1] 135.3673
The sampling distribution is skewed right with a mean of \(112.4866\) and a variance of \(135.3673\)
mean(Wbars < 90)
## [1] 0.01796
\(\widehat{P(X+Y < 40)} = 0.01796\)
set.seed(888)
pnorm(5, mean = 8, sd = 8/sqrt(10))
## [1] 0.11784
\(\widehat{P(X < 5)} = 0.11784\)
Although our theoretical mean and our simulated mean are similar, the CLT does not apply to the problem to the problem above as \(n\ngeq30\). Therefore, we cannot assume that the sampling distribution is normal.
set.seed(888)
sims = 100000
Xbars = numeric(sims)
for (i in 1:sims) {
x = rexp(n = 10, rate = 1/8)
Xbars[i] = mean(x)
}
mean(Xbars < 5)
## [1] 0.10225
\(\widehat{P(X < 5)} = 0.10225\)
R, draw random samples (without replacement) of size 3 from each population, and simulate the sampling distribution of the sum of their maximum. Describe the distribution.set.seed(888)
A = c(3, 5, 7, 9, 10, 16)
B = c(8, 10, 11, 15, 18, 25, 28)
sims = 100000
Wbars = numeric(sims)
for (i in 1:sims) {
x = sample(A, size = 3, replace = FALSE)
y = sample(B, size = 3, replace = FALSE)
Wbars[i] = max(x) + max(y)
}
mean(Wbars)
## [1] 36.51566
sd(Wbars)
## [1] 6.013892
ggplot(data.frame(x = Wbars), aes(x = x)) +
geom_histogram(bins = 20, color = "white", fill = "steelblue") +
theme_bw()
The distribution is skewed left with a mean of \(36.51566\) and a standard error of \(6.013892\)
mean(Wbars < 20)
## [1] 0.00145
\(\widehat{P(max(\bar{A})+max(\bar{B})) < 20)} = 0.00145\)
R, max(union(a, b)) returns the maximum of the union of sets a and b.set.seed(888)
A = c(3, 5, 7, 9, 10, 16)
B = c(8, 10, 11, 15, 18, 25, 28)
sims = 100000
Wbars = numeric(sims)
for (i in 1:sims) {
a = sample(A, size = 3, replace = FALSE)
b = sample(B, size = 3, replace = FALSE)
Wbars[i] = max(union(a, b))
}
ggplot(data.frame(x = Wbars), aes(x = x)) +
geom_histogram(bins = 20, color = "white", fill = "steelblue") +
theme_bw()
Similar skewed-right distribution as before, but with a lower mean and fewer unique values
mean(Wbars < 20)
## [1] 0.28522
\(\widehat{P(max(\bar{A} \cup \bar{B})) < 20)} = 0.00145\)
FishMercury contains mercury levels (parts per million) for 30 fish caught in lakes in Minnesota.ggplot(data = FishMercury, aes(x = Mercury)) +
geom_histogram(bins = 20, fill = "steelblue", color = "white")
set.seed(888)
n <- length(FishMercury$Mercury) # This is the original sample size
B <- 10000 # The number of bootstrap samples
boot_Mean <- numeric(B) # A vector to store bootstrap menas from the bootstrap samples
for (i in 1:B){
x <- sample(FishMercury$Mercury, size = n, replace = TRUE) # Here n is the size of your bootstrap sample
boot_Mean[i] <- mean(x)
}
mean(boot_Mean) # This is the mean of the bootstrap means - so the center of the bootstrap distribution
## [1] 0.1819197
sd(boot_Mean) # This is the standard error of the bootstrap distribution
## [1] 0.05775204
quantile(boot_Mean, probs = c(0.025,0.975))
## 2.5% 97.5%
## 0.1120000 0.3069692
\(\bar{X} = 0.1819197\)
\(\sigma_X = 0.05775204\)
We are 95% confident that the mean mercury concentration in fish is between the values of \(0.112\) and \(0.3069692\).
FishMercuryNoOutlier = FishMercury[-c(1),]
head(FishMercuryNoOutlier)
## [1] 0.160 0.088 0.160 0.145 0.099 0.101
ggplot(data = data.frame(FishMercuryNoOutlier), aes(x = FishMercuryNoOutlier)) +
geom_histogram(bins = 10, fill = "steelblue", color = "white")
set.seed(888)
n <- length(FishMercuryNoOutlier) # This is the original sample size
B <- 10000 # The number of bootstrap samples
boot_Mean <- numeric(B) # A vector to store bootstrap menas from the bootstrap samples
for (i in 1:B){
x <- sample(FishMercuryNoOutlier, size = n, replace = TRUE) # Here n is the size of your bootstrap sample
boot_Mean[i] <- mean(x)
}
mean(boot_Mean) # This is the mean of the bootstrap means - so the center of the bootstrap distribution
## [1] 0.123513
sd(boot_Mean) # This is the standard error of the bootstrap distribution
## [1] 0.007774483
quantile(boot_Mean, probs = c(0.025,0.975))
## 2.5% 97.5%
## 0.1079655 0.1385172
\(\bar{X} = 0.123513\)
\(\sigma_X = 0.007774483\)
We are 95% confident that the mean mercury concentration in fish without outliers is between the values of \(0.1079655\) and \(0.1385172\).
By removing the outlier from the distribution, the mean decreased and the standard error decreased dramatically (\(0.05775204\) with the outlier and \(0.007774483\) without).
BeerwingsMale = Beerwings %>% filter(Gender == "M")
BeerwingsFemale = Beerwings %>% filter(Gender == "F")
set.seed(888)
nm = nrow(BeerwingsMale)
nf = nrow(BeerwingsFemale)
B = 10000
boot_Mean_Diff = numeric(B)
for (i in 1:B) {
m <- sample(BeerwingsMale$Hotwings, size = nm, replace = TRUE)
f <- sample(BeerwingsFemale$Hotwings, size = nf, replace = TRUE)
boot_Mean_Diff[i] <- mean(m) - mean(f) #male - female
}
ggplot(data = data.frame(boot_Mean_Diff), aes(x = boot_Mean_Diff)) +
geom_histogram(bins = 20, fill = "orange", color = "grey20")
mean(boot_Mean_Diff)
## [1] 5.19644
sd(boot_Mean_Diff)
## [1] 1.448123
The bootstrap distribution of the difference of the means is normally distributed with a mean of \(5.19644\) and a standard error of \(1.448123\).
quantile(boot_Mean_Diff, probs = c(0.025,0.975))
## 2.5% 97.5%
## 2.333333 8.000000
We are 95% confident that men consume, on average, between \(2.33\) and \(8\) more hotwings than women.
Girls2004 (see Section 1.2).skim() to obtain summary statistics on the weight of baby girls born in Wyoming and Alaska.Girls2004 %>% skim()
| Name | Piped data |
| Number of rows | 80 |
| Number of columns | 6 |
| _______________________ | |
| Column type frequency: | |
| factor | 3 |
| numeric | 3 |
| ________________________ | |
| Group variables | None |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| State | 0 | 1 | FALSE | 2 | AK: 40, WY: 40 |
| MothersAge | 0 | 1 | FALSE | 6 | 20-: 29, 25-: 22, 30-: 15, 15-: 6 |
| Smoker | 0 | 1 | FALSE | 2 | No: 69, Yes: 11 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| ID | 0 | 1 | 40.50 | 23.24 | 1 | 20.75 | 40.5 | 60.25 | 80 | ▇▇▇▇▇ |
| Weight | 0 | 1 | 3362.12 | 525.24 | 2182 | 3076.25 | 3331.0 | 3709.00 | 4592 | ▂▃▇▃▂ |
| Gestation | 0 | 1 | 39.14 | 1.22 | 37 | 38.00 | 39.0 | 40.00 | 42 | ▆▇▅▃▁ |
GirlsWY = Girls2004 %>% filter(State == "WY")
GirlsAK = Girls2004 %>% filter(State == "AK")
set.seed(888)
nwy = nrow(GirlsWY)
nak = nrow(GirlsAK)
B = 10000
boot_Mean_Diff = numeric(B)
for (i in 1:B) {
wy <- sample(GirlsWY$Weight, size = nwy, replace = TRUE)
ak <- sample(GirlsAK$Weight, size = nak, replace = TRUE)
boot_Mean_Diff[i] <- mean(wy) - mean(ak) #wyoming - alaska
}
ggplot(data = data.frame(boot_Mean_Diff), aes(x = boot_Mean_Diff)) +
geom_histogram(bins = 20, fill = "pink", color = "white")
mean(boot_Mean_Diff)
## [1] -307.6122
sd(boot_Mean_Diff)
## [1] 111.1971
quantile(boot_Mean_Diff, probs = c(0.025,0.975))
## 2.5% 97.5%
## -523.12688 -84.87063
The bootstrap distribution of the difference of the means is normally distributed with a mean of \(-307.6122\) and a standard error of \(111.1971\).
\(\bar{X} = -307.6122\)
\(\sigma_X = 111.1971\)
We are 95% confident that Wyoming girl babies are between \(84.87063\) and \(523.12688\) grams lighter than Alaska girl babies.
bootMeanDiff = mean(boot_Mean_Diff)
ogMeanDiff = (mean(GirlsWY$Weight) - mean(GirlsAK$Weight))
Bias = bootMeanDiff - ogMeanDiff
Bias
## [1] 0.8378375
\(Bias = -307.6122 - (-308.45) = 0.8378375\)
IceCream contains calorie information for a sample of brands of chocolate and vanilla ice cream. Use the bootstrap to determine whether or not there is a difference in the mean number of calories.Chocolate = IceCream$ChocolateCalories
Vanilla = IceCream$VanillaCalories
CalorieDiff = Chocolate - Vanilla
set.seed(888)
n = length(CalorieDiff)
B = 10000
boot_Mean_Diff = numeric(B)
for (i in 1:B) {
samp = sample(CalorieDiff, size = n, replace = TRUE)
boot_Mean_Diff[i] = mean(samp)
}
quantile(boot_Mean_Diff, probs = c(0.025,0.975))
## 2.5% 97.5%
## 3.435897 11.435897
We are 95% confident that the mean calories in chocolate ice cream is between \(3.435897\) and \(11.435897\) calories higher than the mean number of calories in vanilla ice cream.